Update worldstate state change behavior in case of FCU #5699

matkt · 2023-07-13T09:17:42Z

PR description

We have identified a potential issue with the FCU mechanism. Based on the logs, it seems that Besu considers each new head as a reorganization (reorg) because of this line https://github.com/hyperledger/besu/blob/main/consensus/merge/src/main/java/org/hyperledger/besu/consensus/merge/blockcreation/MergeCoordinator.java#L610 so we are not updating the worldstate if the new head is not a direct descendant of the current head, which is rare in general. As a result, we enter the "rewindToBlock" method https://github.com/hyperledger/besu/blob/main/consensus/merge/src/main/java/org/hyperledger/besu/consensus/merge/blockcreation/MergeCoordinator.java#L633, which performs a chain reorg (in this case, a move forward of the chain) without changing the world state. However, with Prysm, we only encounter this case and keep advancing the chain without advancing the world state. Consequently, with each "engineNewPayload" call, we need to apply a larger number of trie logs to process the new block, which leads to a node crash.

Upon examining the code, it is indeed not normal to have a scenario where we change the blockchain without changing the world state.

Fixed Issue(s)

Signed-off-by: Karim TAAM <karim.t2am@gmail.com>

github-actions · 2023-07-13T09:17:53Z

I thought about documentation and added the doc-change-required label to this PR if updates are required.
I thought about the changelog and included a changelog update if required.
If my PR includes database changes (e.g. KeyValueSegmentIdentifier) I have thought about compatibility and performed forwards and backwards compatibility tests

Signed-off-by: Karim TAAM <karim.t2am@gmail.com>

ahamlat · 2023-07-31T14:37:15Z

I conducted a test on the mainnet using this PR and version 23.4.4. During the experiment, I halted Prysm on both instances for a duration of 2 hours. Upon observation, the node operating with this PR successfully recovered, whereas the node running version 23.4.4 encountered a continuous decline in performance due to the lack of world state updates. Consequently, the node eventually crashed

I also tested different Consensus clients with different scenarios

Initial Sync (performs backwards sync as last stage)
Scenario 1: CL restart
Scenario 2: Besu restart
Scenario 3: CL short downtime: 60 mins
Scenario 4: CL long downtime: more than 24 hours
Scenario 5: Besu short downtime: 60 mins
Scenario 6: Besu longer downtime: more than 24 hours

Scenario/CL	Nimbus	Teku	Prysm	Lighthouse	Lodestar
Initial Sync	✅	✅	✅	✅	✅
Scenario 1	✅	✅	✅	✅	✅
Scenario 2	⚠️ *	✅	✅	✅	✅
Scenario 3	✅ Backward sync	✅	✅	✅	✅
Scenario 4	✅ Backward sync	✅	✅	✅	✅
Scenario 5	✅ Backward sync	✅	✅	✅	✅
Scenario 6	✅ Backward sync	✅ Backward sync	✅	✅ Backward sync	✅

: When restarting Besu with Nimbus as a CL, Besu receives new payload calls on chainhead+2 blocks. This triggers a backward sync for each new payload because the parent of the block doesn't exist in the blockchain. Eventually, Besu is able to recover from this loop of backward syncs with one block (see logs below).

Also, setting a finalized block to the last finalized block on newPayload calls is a bit confusing.

{"@timestamp":"2023-08-01T11:34:00,218","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"maybeHeadHeader is not present","throwable":""}
{"@timestamp":"2023-08-01T11:34:00,218","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"chain head = 4005502 ","throwable":""}
{"@timestamp":"2023-08-01T11:34:00,218","level":"DEBUG","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"Appending new head block hash 0xd7beb71befb73a0cb426d2d00b977d7ae7355fb4402ecf8522fccd8d79671e53 to backward sync","throwable":""}
{"@timestamp":"2023-08-01T11:34:00,218","level":"INFO","thread":"vert.x-worker-thread-0","class":"BackwardSyncContext","message":"Starting a new backward sync session","throwable":""}
{"@timestamp":"2023-08-01T11:34:00,244","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"maybeHeadHeader is not present","throwable":""}
{"@timestamp":"2023-08-01T11:34:00,244","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"chain head = 4005502 ","throwable":""}
{"@timestamp":"2023-08-01T11:34:00,244","level":"DEBUG","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"Appending new head block hash 0x507d68e4e715e1aa1a761fa2a2af71c4cd8807793c521f8ac99618675efe15b1 to backward sync","throwable":""}
{"@timestamp":"2023-08-01T11:34:01,423","level":"INFO","thread":"nioEventLoopGroup-3-2","class":"BackwardSyncContext","message":"Backward sync phase 2 of 2 completed, imported a total of 1 blocks. Peers: 3","throwable":""}
{"@timestamp":"2023-08-01T11:34:01,424","level":"INFO","thread":"nioEventLoopGroup-3-2","class":"BackwardSyncAlgorithm","message":"Current backward sync session is done","throwable":""}
{"@timestamp":"2023-08-01T11:34:01,503","level":"DEBUG","thread":"nioEventLoopGroup-3-2","class":"MergeCoordinator","message":"Setting finalized block header to 4005417 (0x341c270f8d1012eecac0572746e04fc616a1fbf183ac3c46f9b80d7403343579)","throwable":""}
{"@timestamp":"2023-08-01T11:34:01,503","level":"DEBUG","thread":"nioEventLoopGroup-3-2","class":"MergeCoordinator","message":"Setting finalized block header to 4005444 (0x6c7ce692d6ef60076df9bbc33c2e41bfae7551028c599cfab7d0fa89b5ffc61f)","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,103","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"maybeHeadHeader is not present","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,103","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"chain head = 4005503 ","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,103","level":"DEBUG","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"Appending new head block hash 0x507d68e4e715e1aa1a761fa2a2af71c4cd8807793c521f8ac99618675efe15b1 to backward sync","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,103","level":"INFO","thread":"vert.x-worker-thread-0","class":"BackwardSyncContext","message":"Starting a new backward sync session","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,120","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"maybeHeadHeader is not present","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,121","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"chain head = 4005503 ","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,121","level":"DEBUG","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"Appending new head block hash 0x5bdf6acda6c7589286199ff50faa43f335ca339d4717c22fba508c6607505673 to backward sync","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,591","level":"INFO","thread":"nioEventLoopGroup-3-2","class":"BackwardSyncContext","message":"Backward sync phase 2 of 2 completed, imported a total of 1 blocks. Peers: 3","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,591","level":"INFO","thread":"nioEventLoopGroup-3-2","class":"BackwardSyncAlgorithm","message":"Current backward sync session is done","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,700","level":"DEBUG","thread":"nioEventLoopGroup-3-2","class":"MergeCoordinator","message":"Setting finalized block header to 4005417 (0x341c270f8d1012eecac0572746e04fc616a1fbf183ac3c46f9b80d7403343579)","throwable":""}
{"@timestamp":"2023-08-01T11:34:12,700","level":"DEBUG","thread":"nioEventLoopGroup-3-2","class":"MergeCoordinator","message":"Setting finalized block header to 4005444 (0x6c7ce692d6ef60076df9bbc33c2e41bfae7551028c599cfab7d0fa89b5ffc61f)","throwable":""}

Signed-off-by: ahamlat <ameziane.hamlat@consensys.net>

Signed-off-by: Ameziane H <ameziane.hamlat@consensys.net>

garyschulte

This looks good to me, but we should call out explicitly that we are using backward sync as an asynchronous way of moving head. We would probably benefit from having an explicit behavior for this whether it is part of backward sync service or not. As it is, when we encounter this condition, I believe we are going to unnecessarily download blocks we already have in blockchain storage.

garyschulte · 2023-08-02T00:14:52Z

...merge/src/main/java/org/hyperledger/besu/consensus/merge/blockcreation/MergeCoordinator.java

-    if (maybeHeadHeader.isPresent()) {
-      LOG.atDebug()
+    if (maybeHeadHeader.isPresent()
+        && Math.abs(maybeHeadHeader.get().getNumber() - chainHead) < 500) {


this is only necessary for bonsai, perhaps we can defer to the WorldStateArchive to provide this limit? eventually an archive-friendly version of bonsai would not need this also

if we use backward sync mechanism as an asynchronous worldstate move, we should probably make backward sync aware of the fact that we already have those blocks. That is an optimization though and we can do that in a subsequent PR.

I removed this modification and reduced the scope of this PR. normally it will be enough to fix the problem (testing now)

gfukushima · 2023-08-03T07:38:50Z

...merge/src/main/java/org/hyperledger/besu/consensus/merge/blockcreation/MergeCoordinator.java

-    if (maybeHeadHeader.isPresent()) {
-      LOG.atDebug()
+    if (maybeHeadHeader.isPresent()
+        && Math.abs(maybeHeadHeader.get().getNumber() - chainHead) < 500) {


Can we replace the number here for a named constant to describe where it comes from?

Good Catch. Indeed I used this number for the tests but we must definitely use the Max Layer that we have for bonsai

besu/ethereum/core/src/main/java/org/hyperledger/besu/ethereum/bonsai/trielog/AbstractTrieLogManager.java

Line 42 in ec6a6b1

public static final long RETAINED_LAYERS = 512; // at least 256 + typical rollbacks

isn't that configurable though?

Yes there is a flag to change this value

if we defer to a WorldStateArchive method that provides this limit, we can skip the behavior for forest since it should not have a worldstate move issue.

I removed this part. I will just keep the worldtsate head update modification https://github.com/hyperledger/besu/pull/5699/files#diff-6da2d65dd7a0564527e8ce3ad7c4482f599fb2af30e37316c47854e5aed2e977R610 . I'm doing new test now with this last version

Signed-off-by: Ameziane H <ameziane.hamlat@consensys.net>

Signed-off-by: Karim TAAM <karim.t2am@gmail.com>

matkt · 2023-08-16T12:11:10Z

I conducted the same test as @ahamlat on the mainnet using this PR (after my last modification)

Initial Sync (performs backwards sync as last stage)
Scenario 1: CL restart
Scenario 2: Besu restart
Scenario 3: CL short downtime: 60 mins
Scenario 4: CL long downtime: more than 24 hours
Scenario 5: Besu short downtime: 60 mins
Scenario 6: Besu longer downtime: more than 24 hours

Scenario/CL	Nimbus	Teku	Prysm	Lighthouse	Lodestar
Initial Sync	✅	✅	✅	✅	✅
Scenario 1	✅	✅	✅	✅	✅
Scenario 2	✅	✅	✅	✅	✅
Scenario 3	✅	✅	✅	✅	✅
Scenario 4	✅	✅	✅	✅	✅
Scenario 5	✅	✅	✅	✅	✅
Scenario 6	✅	✅	✅	✅	✅

garyschulte

This makes much more sense than the prior behavior. I don't think we should have been doing a chain reorg without moving the worldstate.

🚢

gfukushima · 2023-08-18T02:29:21Z

...merge/src/main/java/org/hyperledger/besu/consensus/merge/blockcreation/MergeCoordinator.java

  }

-  private boolean forwardWorldStateTo(final BlockHeader newHead) {
+  private boolean moveWorldStateTo(final BlockHeader newHead) {


This rename implies that we could use this method to move the ws "backwards", is that correct?

yes we can do rollback and rollfoward with this code

) Update the worldstate in the same way as the blockchain in order to avoid having inconsistencies between the two and later trigger a big rollforward Signed-off-by: Karim TAAM <karim.t2am@gmail.com> Co-authored-by: Ameziane H <ameziane.hamlat@consensys.net>

) Update the worldstate in the same way as the blockchain in order to avoid having inconsistencies between the two and later trigger a big rollforward Signed-off-by: Karim TAAM <karim.t2am@gmail.com> Co-authored-by: Ameziane H <ameziane.hamlat@consensys.net> Signed-off-by: garyschulte <garyschulte@gmail.com>

) Update the worldstate in the same way as the blockchain in order to avoid having inconsistencies between the two and later trigger a big rollforward Signed-off-by: Karim TAAM <karim.t2am@gmail.com> Co-authored-by: Ameziane H <ameziane.hamlat@consensys.net>

update fcu

7d66d39

Signed-off-by: Karim TAAM <karim.t2am@gmail.com>

matkt added 4 commits July 13, 2023 11:48

fix rewind to block

fc718d2

Signed-off-by: Karim TAAM <karim.t2am@gmail.com>

trigger backward sync when distance is too high

8fd0025

Signed-off-by: Karim TAAM <karim.t2am@gmail.com>

update code

cde4bfd

Signed-off-by: Karim TAAM <karim.t2am@gmail.com>

fix tests

940a6e1

Signed-off-by: Karim TAAM <karim.t2am@gmail.com>

matkt changed the title ~~update fcu~~ Update worldstate state change behavior in case of FCU and EngineNewPayload Aug 1, 2023

matkt marked this pull request as ready for review August 1, 2023 08:35

ahamlat added 2 commits August 1, 2023 11:09

Merge branch 'main' into feature/update-fcu

d9b6593

Signed-off-by: ahamlat <ameziane.hamlat@consensys.net>

Fix a build issue following a merge with main.

dd2bee4

Signed-off-by: Ameziane H <ameziane.hamlat@consensys.net>

non-fungible-nelson mentioned this pull request Aug 1, 2023

Besu uses 100 % CPU after losing the internet connection #5348

Closed

non-fungible-nelson added mainnet TeamChupa GH issues worked on by Chupacabara Team labels Aug 1, 2023

non-fungible-nelson mentioned this pull request Aug 1, 2023

Backwards sync regression #5464

Closed

garyschulte reviewed Aug 2, 2023

View reviewed changes

gfukushima reviewed Aug 3, 2023

View reviewed changes

siladu mentioned this pull request Aug 3, 2023

Backward syncing issues when restarting Nimbus #5411

Closed

ahamlat and others added 6 commits August 4, 2023 11:11

Remove triggering backward sync on NewPayload as it is already managed.

dfbb338

Signed-off-by: Ameziane H <ameziane.hamlat@consensys.net>

Spotless

ebe2766

Signed-off-by: Ameziane H <ameziane.hamlat@consensys.net>

Fix unit tests.

2b89261

Signed-off-by: Ameziane H <ameziane.hamlat@consensys.net>

Merge branch 'main' into feature/update-fcu

615a3d4

revert getOrSyncHeadByHash modification

1c2a574

Signed-off-by: Karim TAAM <karim.t2am@gmail.com>

Merge branch 'main' into feature/update-fcu

8691897

matkt changed the title ~~Update worldstate state change behavior in case of FCU and EngineNewPayload~~ Update worldstate state change behavior in case of FCU Aug 17, 2023

matkt added 2 commits August 17, 2023 09:33

Merge branch 'main' into feature/update-fcu

7ca17a3

Merge branch 'main' into feature/update-fcu

4ed3f4b

matkt requested review from gfukushima and garyschulte August 17, 2023 19:29

garyschulte approved these changes Aug 18, 2023

View reviewed changes

gfukushima approved these changes Aug 18, 2023

View reviewed changes

gfukushima reviewed Aug 18, 2023

View reviewed changes

matkt merged commit 9706bd7 into hyperledger:main Aug 18, 2023
8 checks passed

non-fungible-nelson mentioned this pull request Sep 19, 2023

Remove backward sync functionality from engine_newPayload. #5687

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update worldstate state change behavior in case of FCU #5699

Update worldstate state change behavior in case of FCU #5699

matkt commented Jul 13, 2023 •

edited

Loading

github-actions bot commented Jul 13, 2023

ahamlat commented Jul 31, 2023 •

edited by matkt

Loading

garyschulte left a comment •

edited

Loading

garyschulte Aug 2, 2023

garyschulte Aug 2, 2023

matkt Aug 16, 2023

gfukushima Aug 3, 2023

matkt Aug 3, 2023

matkt Aug 3, 2023

gfukushima Aug 4, 2023

matkt Aug 4, 2023

garyschulte Aug 4, 2023

matkt Aug 16, 2023

matkt commented Aug 16, 2023 •

edited

Loading

garyschulte left a comment

gfukushima Aug 18, 2023 •

edited

Loading

matkt Aug 18, 2023

Update worldstate state change behavior in case of FCU #5699

Update worldstate state change behavior in case of FCU #5699

Conversation

matkt commented Jul 13, 2023 • edited Loading

PR description

Fixed Issue(s)

github-actions bot commented Jul 13, 2023

ahamlat commented Jul 31, 2023 • edited by matkt Loading

garyschulte left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matkt commented Aug 16, 2023 • edited Loading

garyschulte left a comment

Choose a reason for hiding this comment

gfukushima Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matkt commented Jul 13, 2023 •

edited

Loading

ahamlat commented Jul 31, 2023 •

edited by matkt

Loading

garyschulte left a comment •

edited

Loading

matkt commented Aug 16, 2023 •

edited

Loading

gfukushima Aug 18, 2023 •

edited

Loading